Evaluation of Poincare Embeddings

This notebook demonstrates how well Poincare embeddings perform on the tasks detailed in the original paper about the embeddings.

The following two external, open-source implementations are used -

C++
Numpy

This is the list of tasks -

WordNet reconstruction
WordNet link prediction
Link prediction in collaboration networks (evaluation incomplete)
Lexical entailment on HyperLex

A more detailed explanation of the tasks and the evaluation methodology is present in the individual evaluation subsections.

1. Setup

The following section performs the following -

Imports required python libraries and downloads the wordnet data
Clones the repositories containing the C++ and Numpy implementations of the Poincare embeddings
Applies patches containing minor changes to the implementations.
Compiles the C++ sources to create a binary



In [62]:

    
% cd ../..









    



/home/jayant/projects/gensim



In [64]:

    
% cd docs/notebooks/









    



/home/jayant/projects/gensim/docs/notebooks



In [63]:

    
import csv
from collections import OrderedDict
import logging
import os
import pickle
import random
import re

import click
from gensim.models.poincare import PoincareModel, PoincareRelations, \
    ReconstructionEvaluation, LinkPredictionEvaluation, \
    LexicalEntailmentEvaluation, PoincareKeyedVectors
from gensim.utils import check_output
import nltk
from prettytable import PrettyTable
from smart_open import smart_open

logging.basicConfig(level=logging.INFO)
nltk.download('wordnet')









    



[nltk_data] Downloading package wordnet to /home/jayant/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!






    Out[63]:





True

Note that not all the above libraries are part of the gensim dependencies, so they might need to be installed separately. These requirements are listed in the poincare requirements.txt

Please set the variable parent_directory below to change the directory to which the repositories are cloned.



In [65]:

    
current_directory = os.getcwd()



In [66]:

    
# Change this variable to `False` to not remove and re-download repos for external implementations
force_setup = False

# The poincare datasets, models and source code for external models are downloaded to this directory
parent_directory = os.path.join(current_directory, 'poincare')
! mkdir -p {parent_directory}



In [67]:

    
% cd {parent_directory}

# Clone repos
np_repo_name = 'poincare-np-embedding'
if force_setup and os.path.exists(np_repo_name):
    ! rm -rf {np_repo_name}
clone_np_repo = not os.path.exists(np_repo_name)
if clone_np_repo:
    ! git clone https://github.com/nishnik/poincare_embeddings.git {np_repo_name}

cpp_repo_name = 'poincare-cpp-embedding'
if force_setup and os.path.exists(cpp_repo_name):
    ! rm -rf {cpp_repo_name}
clone_cpp_repo = not os.path.exists(cpp_repo_name)
if clone_cpp_repo:
    ! git clone https://github.com/TatsuyaShirakawa/poincare-embedding.git {cpp_repo_name}

patches_applied = False









    



/home/jayant/projects/gensim/docs/notebooks/poincare



In [68]:

    
# Apply patches
if clone_cpp_repo and not patches_applied:
    % cd {cpp_repo_name}
    ! git apply ../poincare_burn_in_eps.patch

if clone_np_repo and not patches_applied:
    % cd ../{np_repo_name}
    ! git apply ../poincare_numpy.patch
    
patches_applied = True



In [69]:

    
# Compile the code for the external c++ implementation into a binary
% cd {parent_directory}/{cpp_repo_name}
! mkdir -p work
% cd work
! cmake ..
! make
% cd {current_directory}









    



/home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding
/home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding/work
-- Configuring done
-- Generating done
-- Build files have been written to: /home/jayant/projects/gensim/docs/notebooks/poincare/poincare-cpp-embedding/work
[100%] Built target poincare_embedding
/home/jayant/projects/gensim/docs/notebooks

You might need to install an updated version of cmake to be able to compile the source code. Please make sure that the binary poincare_embedding has been created before proceeding by verifying the above cell does not raise an error.



In [70]:

    
cpp_binary_path = os.path.join(parent_directory, cpp_repo_name, 'work', 'poincare_embedding')
assert(os.path.exists(cpp_binary_path)), 'Binary file doesnt exist at %s' % cpp_binary_path

2. Training

2.1 Create the data



In [71]:

    
# These directories are auto created in the current directory for storing poincare datasets and models
data_directory = os.path.join(parent_directory, 'data')
models_directory = os.path.join(parent_directory, 'models')

# Create directories
! mkdir -p {data_directory}
! mkdir -p {models_directory}



In [72]:

    
# Prepare the WordNet data
wordnet_file = os.path.join(data_directory, 'wordnet_noun_hypernyms.tsv')
if not os.path.exists(wordnet_file):
    ! python {parent_directory}/{cpp_repo_name}/scripts/create_wordnet_noun_hierarchy.py {wordnet_file}



In [73]:

    
# Prepare the HyperLex data
hyperlex_url = "http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip"
! wget {hyperlex_url} -O {data_directory}/hyperlex-data.zip
if os.path.exists(os.path.join(data_directory, 'hyperlex')):
    ! rm -r {data_directory}/hyperlex
! unzip {data_directory}/hyperlex-data.zip -d {data_directory}/hyperlex/
hyperlex_file = os.path.join(data_directory, 'hyperlex', 'nouns-verbs', 'hyperlex-nouns.txt')









    



--2017-11-14 11:15:54--  http://people.ds.cam.ac.uk/iv250/paper/hyperlex/hyperlex-data.zip
Resolving people.ds.cam.ac.uk (people.ds.cam.ac.uk)... 131.111.3.47
Connecting to people.ds.cam.ac.uk (people.ds.cam.ac.uk)|131.111.3.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 183900 (180K) [application/zip]
Saving to: ‘/home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip’

/home/jayant/projec 100%[===================>] 179.59K  --.-KB/s    in 0.06s   

2017-11-14 11:15:54 (2.94 MB/s) - ‘/home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip’ saved [183900/183900]

Archive:  /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex-data.zip
   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/hyperlex-verbs.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/nouns-verbs/hyperlex-nouns.txt  
   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/
   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_training_all_random.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_test_all_random.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/random/hyperlex_dev_all_random.txt  
   creating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_dev_all_lexical.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_test_all_lexical.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/splits/lexical/hyperlex_training_all_lexical.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/hyperlex-all.txt  
  inflating: /home/jayant/projects/gensim/docs/notebooks/poincare/data/hyperlex/README.txt

2.2 Training C++ embeddings



In [74]:

    
def train_cpp_model(
    binary_path, data_file, output_file, dim, epochs, neg,
    num_threads, epsilon, burn_in, seed=0):
    """Train a poincare embedding using the c++ implementation
    
    Args:
        binary_path (str): Path to the compiled c++ implementation binary
        data_file (str): Path to tsv file containing relation pairs
        output_file (str): Path to output file containing model
        dim (int): Number of dimensions of the trained model
        epochs (int): Number of epochs to use
        neg (int): Number of negative samples to use
        num_threads (int): Number of threads to use for training the model
        epsilon (float): Constant used for clipping below a norm of one
        burn_in (int): Number of epochs to use for burn-in init (0 means no burn-in)
    
    Notes: 
        If `output_file` already exists, skips training
    """
    if os.path.exists(output_file):
        print('File %s exists, skipping' % output_file)
        return
    args = {
        'dim': dim,
        'max_epoch': epochs,
        'neg_size': neg,
        'num_thread': num_threads,
        'epsilon': epsilon,
        'burn_in': burn_in,
        'learning_rate_init': 0.1,
        'learning_rate_final': 0.0001,
    }
    cmd = [binary_path, data_file, output_file]
    for option, value in args.items():
        cmd.append("--%s" % option)
        cmd.append(str(value))
    
    return check_output(args=cmd)



In [75]:

    
model_sizes = [5, 10, 20, 50, 100, 200]
default_params = {
    'neg': 20,
    'epochs': 50,
    'threads': 8,
    'eps': 1e-6,
    'burn_in': 0,
    'batch_size': 10,
}

non_default_params = {
    'neg': [10],
    'epochs': [200],
    'burn_in': [10]
}



In [76]:

    
def cpp_model_name_from_params(params, prefix):
    param_keys = ['burn_in', 'epochs', 'neg', 'eps', 'threads']
    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]
    return '%s_%s' % (prefix, '_'.join(name))

def train_model_with_params(params, train_file, model_sizes, prefix, implementation):
    """Trains models with given params for multiple model sizes using the given implementation
    
    Args:
        params (dict): parameters to train the model with
        train_file (str): Path to tsv file containing relation pairs
        model_sizes (list): list of dimension sizes (integer) to train the model with
        prefix (str): prefix to use for the saved model filenames
        implementation (str): whether to use the numpy or c++ implementation,
                              allowed values: 'numpy', 'c++'
   
   Returns:
        tuple (model_name, model_files)
        model_files is a dict of (size, filename) pairs
        Example: ('cpp_model_epochs_50', {5: 'models/cpp_model_epochs_50_dim_5'})
    """
    files = {}
    if implementation == 'c++':
        model_name = cpp_model_name_from_params(params, prefix)
    elif implementation == 'numpy':
        model_name = np_model_name_from_params(params, prefix)
    elif implementation == 'gensim':
        model_name = gensim_model_name_from_params(params, prefix)
    else:
        raise ValueError('Given implementation %s not found' % implementation)
    for model_size in model_sizes:
        output_file_name = '%s_dim_%d' % (model_name, model_size)
        output_file = os.path.join(models_directory, output_file_name)
        print('Training model %s of size %d' % (model_name, model_size))
        if implementation == 'c++':
            out = train_cpp_model(
                cpp_binary_path, train_file, output_file, model_size,
                params['epochs'], params['neg'], params['threads'],
                params['eps'], params['burn_in'], seed=0)
        elif implementation == 'numpy':
            train_external_numpy_model(
                python_script_path, train_file, output_file, model_size,
                params['epochs'], params['neg'], seed=0)
        elif implementation == 'gensim':
            train_gensim_model(
                train_file, output_file, model_size, params['epochs'],
                params['neg'], params['burn_in'], params['batch_size'], seed=0)
        else:
            raise ValueError('Given implementation %s not found' % implementation)
        files[model_size] = output_file
    return (model_name, files)



In [77]:

    
model_files = {}



In [ ]:

    
model_files['c++'] = {}
# Train c++ models with default params
model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'cpp_model', 'c++')
model_files['c++'][model_name] = {}
for dim, filepath in files.items():
    model_files['c++'][model_name][dim] = filepath
# Train c++ models with non-default params
for param, values in non_default_params.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'cpp_model', 'c++')
        model_files['c++'][model_name] = {}
        for dim, filepath in files.items():
            model_files['c++'][model_name][dim] = filepath

2.3 Training numpy embeddings (non-gensim)



In [79]:

    
python_script_path = os.path.join(parent_directory, np_repo_name, 'poincare.py')



In [80]:

    
def np_model_name_from_params(params, prefix):
    param_keys = ['neg', 'epochs']
    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]
    return '%s_%s' % (prefix, '_'.join(name))

def train_external_numpy_model(
    script_path, data_file, output_file, dim, epochs, neg, seed=0):
    """Train a poincare embedding using an external numpy implementation
    
    Args:
        script_path (str): Path to the Python training script
        data_file (str): Path to tsv file containing relation pairs
        output_file (str): Path to output file containing model
        dim (int): Number of dimensions of the trained model
        epochs (int): Number of epochs to use
        neg (int): Number of negative samples to use
    
    Notes: 
        If `output_file` already exists, skips training
    """
    if os.path.exists(output_file):
        print('File %s exists, skipping' % output_file)
        return
    args = {
        'input-file': data_file,
        'output-file': output_file,
        'dimensions': dim,
        'epochs': epochs,
        'learning-rate': 0.01,
        'num-negative': neg,
    }
    cmd = ['python', script_path]
    for option, value in args.items():
        cmd.append("--%s" % option)
        cmd.append(str(value))
    
    return check_output(args=cmd)



In [ ]:

    
model_files['numpy'] = {}
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'np_model', 'numpy')
model_files['numpy'][model_name] = {}
for dim, filepath in files.items():
    model_files['numpy'][model_name][dim] = filepath

2.4 Training gensim embeddings



In [82]:

    
def gensim_model_name_from_params(params, prefix):
    param_keys = ['neg', 'epochs', 'burn_in', 'batch_size']
    name = ['%s_%s' % (key, params[key]) for key in sorted(param_keys)]
    return '%s_%s' % (prefix, '_'.join(name))

def train_gensim_model(
    data_file, output_file, dim, epochs, neg, burn_in, batch_size, seed=0):
    """Train a poincare embedding using gensim implementation
    
    Args:
        data_file (str): Path to tsv file containing relation pairs
        output_file (str): Path to output file containing model
        dim (int): Number of dimensions of the trained model
        epochs (int): Number of epochs to use
        neg (int): Number of negative samples to use
        burn_in (int): Number of epochs to use for burn-in initialization
        batch_size (int): Size of batch to use for training
    
    Notes: 
        If `output_file` already exists, skips training
    """
    if os.path.exists(output_file):
        print('File %s exists, skipping' % output_file)
        return
    train_data = PoincareRelations(data_file)
    model = PoincareModel(train_data, size=dim, negative=neg, burn_in=burn_in)
    model.train(epochs=epochs, batch_size=batch_size)
    model.save(output_file)



In [83]:

    
non_default_params_gensim = {
    'neg': [10],
    'burn_in': [10],
    'batch_size': [50]
}



In [ ]:

    
model_files['gensim'] = {}
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_file, model_sizes, 'gensim_model', 'gensim')
model_files['gensim'][model_name] = {}
for dim, filepath in files.items():
    model_files['gensim'][model_name][dim] = filepath
# Train models with non-default params
for param, values in non_default_params_gensim.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_file, model_sizes, 'gensim_model', 'gensim')
        model_files['gensim'][model_name] = {}
        for dim, filepath in files.items():
            model_files['gensim'][model_name][dim] = filepath

3. Loading the embeddings



In [86]:

    
def transform_cpp_embedding_to_kv(input_file, output_file, encoding='utf8'):
    """Given a C++ embedding tsv filepath, converts it to a KeyedVector-supported file"""
    with smart_open(input_file, 'rb') as f:
        lines = [line.decode(encoding) for line in f]
    if not len(lines):
         raise ValueError("file is empty")
    first_line = lines[0]
    parts = first_line.rstrip().split("\t")
    model_size = len(parts) - 1
    vocab_size = len(lines)
    with open(output_file, 'w') as f:
        f.write('%d %d\n' % (vocab_size, model_size))
        for line in lines:
            f.write(line.replace('\t', ' '))

def transform_numpy_embedding_to_kv(input_file, output_file, encoding='utf8'):
    """Given a numpy poincare embedding pkl filepath, converts it to a KeyedVector-supported file"""
    np_embeddings = pickle.load(open(input_file, 'rb'))
    random_embedding = np_embeddings[list(np_embeddings.keys())[0]]
    
    model_size = random_embedding.shape[0]
    vocab_size = len(np_embeddings)
    with open(output_file, 'w') as f:
        f.write('%d %d\n' % (vocab_size, model_size))
        for key, vector in np_embeddings.items():
            vector_string = ' '.join('%.6f' % value for value in vector)
            f.write('%s %s\n' % (key, vector_string))

def load_poincare_cpp(input_filename):
    """Load embedding trained via C++ Poincare model.

    Parameters
    ----------
    filepath : str
        Path to tsv file containing embedding.

    Returns
    -------
    PoincareKeyedVectors instance.

    """
    keyed_vectors_filename = input_filename + '.kv'
    transform_cpp_embedding_to_kv(input_filename, keyed_vectors_filename)
    embedding = PoincareKeyedVectors.load_word2vec_format(keyed_vectors_filename)
    os.unlink(keyed_vectors_filename)
    return embedding

def load_poincare_numpy(input_filename):
    """Load embedding trained via Python numpy Poincare model.

    Parameters
    ----------
    filepath : str
        Path to pkl file containing embedding.

    Returns:
        PoincareKeyedVectors instance.

    """
    keyed_vectors_filename = input_filename + '.kv'
    transform_numpy_embedding_to_kv(input_filename, keyed_vectors_filename)
    embedding = PoincareKeyedVectors.load_word2vec_format(keyed_vectors_filename)
    os.unlink(keyed_vectors_filename)
    return embedding

def load_poincare_gensim(input_filename):
    """Load embedding trained via Gensim PoincareModel.

    Parameters
    ----------
    filepath : str
        Path to model file.

    Returns:
        PoincareKeyedVectors instance.

    """
    model = PoincareModel.load(input_filename)
    return model.kv

def load_model(implementation, model_file):
    """Convenience function over functions to load models from different implementations.
    
    Parameters
    ----------
    implementation : str
        Implementation used to create model file ('c++'/'numpy'/'gensim').
    model_file : str
        Path to model file.
    
    Returns
    -------
    PoincareKeyedVectors instance
    
    Notes
    -----
    Raises ValueError in case of invalid value for `implementation`

    """
    if implementation == 'c++':
        return load_poincare_cpp(model_file)
    elif implementation == 'numpy':
        return load_poincare_numpy(model_file)
    elif implementation == 'gensim':
        return load_poincare_gensim(model_file)
    else:
        raise ValueError('Invalid implementation %s' % implementation)

4. Evaluation



In [87]:

    
def display_results(task_name, results):
    """Display evaluation results of multiple embeddings on a single task in a tabular format
    
    Args:
        task_name (str): name the task being evaluated
        results (dict): mapping between embeddings and corresponding results
    
    """
    data = PrettyTable()
    data.field_names = ["Model Description", "Metric"] + [str(dim) for dim in sorted(model_sizes)]
    for model_name, model_results in results.items():
        metrics = [metric for metric in model_results.keys()]
        dims = sorted([dim for dim in model_results[metrics[0]].keys()])
        row = [model_name, '\n'.join(metrics) + '\n']
        for dim in dims:
            scores = ['%.2f' % model_results[metric][dim] for metric in metrics]
            row.append('\n'.join(scores))
        data.add_row(row)
    data.align = 'r'
    data_cols = data.get_string().split('\n')[0].split('+')[1:-1]
    col_lengths = [len(col) for col in data_cols]
    header_col_1_length = col_lengths[0] + col_lengths[1] - 1
    header_col_2_length = sum(col_lengths[2:]) + len(col_lengths[2:-1]) - 2
    
    header_col_2_content = "Model Dimensions"
    header_col_2_left_margin = (header_col_2_length - len(header_col_2_content)) // 2
    header_col_2_right_margin = header_col_2_length - len(header_col_2_content) - header_col_2_left_margin
    header_col_2_string = "%s%s%s" % (
        " " * header_col_2_left_margin, header_col_2_content, " " * header_col_2_right_margin)
    header = PrettyTable()
    header.field_names = [" " * header_col_1_length, header_col_2_string]
    header_lines = header.get_string(start=0, end=0).split("\n")[:2]
    print('Results for %s task' % task_name)
    print("\n".join(header_lines))
    print(data)

4.1 WordNet reconstruction



In [88]:

    
reconstruction_results = OrderedDict()
metrics = ['mean_rank', 'MAP']



In [ ]:

    
for implementation, models in sorted(model_files.items()):
    for model_name, files in models.items():
        if model_name in reconstruction_results:
            continue
        reconstruction_results[model_name] = OrderedDict()
        for metric in metrics:
            reconstruction_results[model_name][metric] = {}
        for model_size, model_file in files.items():
            print('Evaluating model %s of size %d' % (model_name, model_size))
            embedding = load_model(implementation, model_file)
            eval_instance = ReconstructionEvaluation(wordnet_file, embedding)
            eval_result = eval_instance.evaluate(max_n=1000)
            for metric in metrics:
                reconstruction_results[model_name][metric][model_size] = eval_result[metric]



In [53]:

    
display_results('WordNet Reconstruction', reconstruction_results)









    



Results for WordNet Reconstruction task
+-----------------------------------------------------------------------+----------------------------------------------------------+
|                                                                       |                     Model Dimensions                     |
+-----------------------------------------------------------+-----------+----------+---------+---------+---------+--------+--------+
|                                         Model Description |    Metric |        5 |      10 |      20 |      50 |    100 |    200 |
+-----------------------------------------------------------+-----------+----------+---------+---------+---------+--------+--------+
| cpp_model_burn_in_10_epochs_50_eps_1e-06_neg_20_threads_8 | mean_rank |   252.86 |  195.73 |  182.57 |  165.33 | 157.37 | 155.78 |
|                                                           |       MAP |     0.26 |    0.32 |    0.34 |    0.36 |   0.36 |   0.36 |
|                                                           |           |          |         |         |         |        |        |
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_20_threads_8 | mean_rank |   265.72 |  116.94 |   90.81 |   59.47 |  55.14 |  54.31 |
|                                                           |       MAP |     0.28 |    0.41 |    0.49 |    0.56 |   0.58 |   0.59 |
|                                                           |           |          |         |         |         |        |        |
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_10_threads_8 | mean_rank |   280.17 |  129.46 |   92.06 |   80.41 |  71.42 |  69.30 |
|                                                           |       MAP |     0.27 |    0.40 |    0.49 |    0.53 |   0.56 |   0.56 |
|                                                           |           |          |         |         |         |        |        |
| cpp_model_burn_in_0_epochs_200_eps_1e-06_neg_20_threads_8 | mean_rank |   191.69 |   97.65 |   72.07 |   55.48 |  46.76 |  49.62 |
|                                                           |       MAP |     0.34 |    0.43 |    0.51 |    0.57 |   0.59 |   0.59 |
|                                                           |           |          |         |         |         |        |        |
|     gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20 | mean_rank |   154.41 |   62.77 |   27.32 |   20.22 |  16.15 |  13.20 |
|                                                           |       MAP |     0.40 |    0.63 |    0.72 |    0.77 |   0.78 |   0.79 |
|                                                           |           |          |         |         |         |        |        |
|     gensim_model_batch_size_50_burn_in_0_epochs_50_neg_20 | mean_rank |   148.51 |   63.67 |   28.36 |   20.23 |  15.75 |  13.59 |
|                                                           |       MAP |     0.38 |    0.62 |    0.72 |    0.76 |   0.78 |   0.79 |
|                                                           |           |          |         |         |         |        |        |
|    gensim_model_batch_size_10_burn_in_10_epochs_50_neg_20 | mean_rank |   108.01 |  100.73 |   97.38 |   94.49 |  94.68 |  89.66 |
|                                                           |       MAP |     0.37 |    0.47 |    0.48 |    0.49 |   0.48 |   0.49 |
|                                                           |           |          |         |         |         |        |        |
|     gensim_model_batch_size_10_burn_in_0_epochs_50_neg_10 | mean_rank |   211.71 |   54.42 |   24.90 |   21.42 |  15.80 |  15.13 |
|                                                           |       MAP |     0.33 |    0.60 |    0.72 |    0.76 |   0.78 |   0.79 |
|                                                           |           |          |         |         |         |        |        |
|                                 np_model_epochs_50_neg_20 | mean_rank | 10169.24 | 5602.84 | 3631.57 | 1088.31 | 537.43 | 345.45 |
|                                                           |       MAP |     0.14 |    0.16 |    0.20 |    0.25 |   0.30 |   0.36 |
|                                                           |           |          |         |         |         |        |        |
+-----------------------------------------------------------+-----------+----------+---------+---------+---------+--------+--------+

4.2 WordNet link prediction

4.2.1 Preparing data



In [89]:

    
def train_test_split(data_file, test_ratio=0.1):
    """Creates train and test files from given data file, returns train/test file names
    
    Args:
        data_file (str): path to data file for which train/test split is to be created
        test_ratio (float): fraction of lines to be used for test data
    
    Returns
        (train_file, test_file): tuple of strings with train file and test file paths
    """
    train_filename = data_file + '.train'
    test_filename = data_file + '.test'
    if os.path.exists(train_filename) and os.path.exists(test_filename):
        print('Train and test files already exist, skipping')
        return (train_filename, test_filename)
    root_nodes, leaf_nodes = get_root_and_leaf_nodes(data_file)
    test_line_candidates = []
    line_count = 0
    all_nodes = set()
    with open(data_file, 'rb') as f:
        for i, line in enumerate(f):
            node_1, node_2 = line.split()
            all_nodes.update([node_1, node_2])
            if (
                    node_1 not in leaf_nodes
                    and node_2 not in leaf_nodes
                    and node_1 not in root_nodes
                    and node_2 not in root_nodes
                    and node_1 != node_2
                ):
                test_line_candidates.append(i)
            line_count += 1

    num_test_lines = int(test_ratio * line_count)
    if num_test_lines > len(test_line_candidates):
        raise ValueError('Not enough candidate relations for test set')
    print('Choosing %d test lines from %d candidates' % (num_test_lines, len(test_line_candidates)))
    test_line_indices = set(random.sample(test_line_candidates, num_test_lines))
    train_line_indices = set(l for l in range(line_count) if l not in test_line_indices)
    
    train_set_nodes = set()
    with open(data_file, 'rb') as f:
        train_file = open(train_filename, 'wb')
        test_file = open(test_filename, 'wb')
        for i, line in enumerate(f):
            if i in train_line_indices:
                train_set_nodes.update(line.split())
                train_file.write(line)
            elif i in test_line_indices:
                test_file.write(line)
            else:
                raise AssertionError('Line %d not present in either train or test line indices' % i)
        train_file.close()
        test_file.close()
    assert len(train_set_nodes) == len(all_nodes), 'Not all nodes from dataset present in train set relations'
    return (train_filename, test_filename)



In [90]:

    
def get_root_and_leaf_nodes(data_file):
    """Return keys of root and leaf nodes from a file with transitive closure relations
    
    Args:
        data_file(str): file path containing transitive closure relations
    
    Returns:
        (root_nodes, leaf_nodes) - tuple containing keys of root and leaf nodes
    """
    root_candidates = set()
    leaf_candidates = set()
    with open(data_file, 'rb') as f:
        for line in f:
            nodes = line.split()
            root_candidates.update(nodes)
            leaf_candidates.update(nodes)
    
    with open(data_file, 'rb') as f:
        for line in f:
            node_1, node_2 = line.split()
            if node_1 == node_2:
                continue
            leaf_candidates.discard(node_1)
            root_candidates.discard(node_2)
    
    return (leaf_candidates, root_candidates)



In [91]:

    
wordnet_train_file, wordnet_test_file = train_test_split(wordnet_file)









    



Train and test files already exist, skipping

4.2.2 Training models



In [92]:

    
# Training models for link prediction
lp_model_files = {}



In [ ]:

    
lp_model_files['c++'] = {}
# Train c++ models with default params
model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')
lp_model_files['c++'][model_name] = {}
for dim, filepath in files.items():
    lp_model_files['c++'][model_name][dim] = filepath
# Train c++ models with non-default params
for param, values in non_default_params.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_train_file, model_sizes, 'cpp_lp_model', 'c++')
        lp_model_files['c++'][model_name] = {}
        for dim, filepath in files.items():
            lp_model_files['c++'][model_name][dim] = filepath



In [ ]:

    
lp_model_files['numpy'] = {}
# Train numpy models with default params
model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'np_lp_model', 'numpy')
lp_model_files['numpy'][model_name] = {}
for dim, filepath in files.items():
    lp_model_files['numpy'][model_name][dim] = filepath



In [ ]:

    
lp_model_files['gensim'] = {}
# Train models with default params
model_name, files = train_model_with_params(default_params, wordnet_train_file, model_sizes, 'gensim_lp_model', 'gensim')
lp_model_files['gensim'][model_name] = {}
for dim, filepath in files.items():
    lp_model_files['gensim'][model_name][dim] = filepath
# Train models with non-default params
for param, values in non_default_params_gensim.items():
    params = default_params.copy()
    for value in values:
        params[param] = value
        model_name, files = train_model_with_params(params, wordnet_train_file, model_sizes, 'gensim_lp_model', 'gensim')
        lp_model_files['gensim'][model_name] = {}
        for dim, filepath in files.items():
            lp_model_files['gensim'][model_name][dim] = filepath

4.2.3 Evaluating models



In [96]:

    
lp_results = OrderedDict()
metrics = ['mean_rank', 'MAP']



In [ ]:

    
for implementation, models in sorted(lp_model_files.items()):
    for model_name, files in models.items():
        lp_results[model_name] = OrderedDict()
        for metric in metrics:
            lp_results[model_name][metric] = {}
        for model_size, model_file in files.items():
            print('Evaluating model %s of size %d' % (model_name, model_size))
            embedding = load_model(implementation, model_file)
            eval_instance = LinkPredictionEvaluation(wordnet_train_file, wordnet_test_file, embedding)
            eval_result = eval_instance.evaluate(max_n=1000)
            for metric in metrics:
                lp_results[model_name][metric][model_size] = eval_result[metric]



In [105]:

    
display_results('WordNet Link Prediction', lp_results)









    



Results for WordNet Link Prediction task
+--------------------------------------------------------------------------+------------------------------------------------------------+
|                                                                          |                      Model Dimensions                      |
+--------------------------------------------------------------+-----------+----------+---------+---------+---------+---------+---------+
|                                            Model Description |    Metric |        5 |      10 |      20 |      50 |     100 |     200 |
+--------------------------------------------------------------+-----------+----------+---------+---------+---------+---------+---------+
| cpp_lp_model_burn_in_0_epochs_200_eps_1e-06_neg_20_threads_8 | mean_rank |   218.26 |   99.09 |   60.50 |   52.24 |   60.81 |   69.13 |
|                                                              |       MAP |     0.15 |    0.24 |    0.31 |    0.35 |    0.36 |    0.36 |
|                                                              |           |          |         |         |         |         |         |
|  cpp_lp_model_burn_in_0_epochs_50_eps_1e-06_neg_10_threads_8 | mean_rank |   230.34 |  123.24 |   75.62 |   65.97 |   55.33 |   56.89 |
|                                                              |       MAP |     0.14 |    0.22 |    0.28 |    0.31 |    0.33 |    0.34 |
|                                                              |           |          |         |         |         |         |         |
|  cpp_lp_model_burn_in_0_epochs_50_eps_1e-06_neg_20_threads_8 | mean_rank |   687.48 |  281.88 |   72.95 |   57.37 |   52.56 |   61.42 |
|                                                              |       MAP |     0.12 |    0.15 |    0.31 |    0.35 |    0.36 |    0.36 |
|                                                              |           |          |         |         |         |         |         |
| cpp_lp_model_burn_in_10_epochs_50_eps_1e-06_neg_20_threads_8 | mean_rank |   236.31 |  214.85 |  193.30 |  180.27 |  169.00 |  163.22 |
|                                                              |       MAP |     0.10 |    0.13 |    0.14 |    0.15 |    0.16 |    0.16 |
|                                                              |           |          |         |         |         |         |         |
|     gensim_lp_model_batch_size_10_burn_in_0_epochs_50_neg_10 | mean_rank |   141.52 |   58.89 |   31.66 |   22.13 |   21.29 |   19.38 |
|                                                              |       MAP |     0.18 |    0.34 |    0.46 |    0.51 |    0.52 |    0.53 |
|                                                              |           |          |         |         |         |         |         |
|     gensim_lp_model_batch_size_10_burn_in_0_epochs_50_neg_20 | mean_rank |   121.42 |   52.51 |   24.61 |   19.96 |   20.44 |   19.55 |
|                                                              |       MAP |     0.19 |    0.37 |    0.46 |    0.52 |    0.50 |    0.54 |
|                                                              |           |          |         |         |         |         |         |
|    gensim_lp_model_batch_size_10_burn_in_10_epochs_50_neg_20 | mean_rank |   154.95 |  138.12 |  122.06 |  117.96 |  112.99 |  110.84 |
|                                                              |       MAP |     0.16 |    0.21 |    0.24 |    0.26 |    0.25 |    0.26 |
|                                                              |           |          |         |         |         |         |         |
|     gensim_lp_model_batch_size_50_burn_in_0_epochs_50_neg_20 | mean_rank |   144.19 |   53.65 |   25.21 |   20.68 |   21.32 |   18.97 |
|                                                              |       MAP |     0.19 |    0.35 |    0.47 |    0.52 |    0.51 |    0.53 |
|                                                              |           |          |         |         |         |         |         |
|                                 np_lp_model_epochs_50_neg_20 | mean_rank | 14100.99 | 7538.02 | 5297.38 | 2365.27 | 1552.72 | 1741.87 |
|                                                              |       MAP |     0.00 |    0.02 |    0.03 |    0.08 |    0.10 |    0.13 |
|                                                              |           |          |         |         |         |         |         |
+--------------------------------------------------------------+-----------+----------+---------+---------+---------+---------+---------+

4.3 HyperLex Lexical Entailment



In [49]:

    
entailment_results = OrderedDict()
eval_instance = LexicalEntailmentEvaluation(hyperlex_file)



In [ ]:

    
for implementation, models in sorted(model_files.items()):
    for model_name, files in models.items():
        if model_name in entailment_results:
            continue
        entailment_results[model_name] = OrderedDict()
        entailment_results[model_name]['spearman'] = {}
        for model_size, model_file in files.items():
            print('Evaluating model %s of size %d' % (model_name, model_size))
            embedding = load_model(implementation, model_file)
            entailment_results[model_name]['spearman'][model_size] = eval_instance.evaluate_spearman(embedding)



In [34]:

    
display_results('Lexical Entailment (HyperLex)', entailment_results)









    



Results for Lexical Entailment (HyperLex) task
+----------------------------------------------------------------------+-----------------------------------------+
|                                                                      |            Model Dimensions             |
+-----------------------------------------------------------+----------+------+------+------+------+------+------+
|                                         Model Description |   Metric |    5 |   10 |   20 |   50 |  100 |  200 |
+-----------------------------------------------------------+----------+------+------+------+------+------+------+
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_20_threads_8 | spearman | 0.45 | 0.44 | 0.47 | 0.44 | 0.46 | 0.45 |
|                                                           |          |      |      |      |      |      |      |
| cpp_model_burn_in_10_epochs_50_eps_1e-06_neg_20_threads_8 | spearman | 0.42 | 0.42 | 0.44 | 0.44 | 0.44 | 0.45 |
|                                                           |          |      |      |      |      |      |      |
|  cpp_model_burn_in_0_epochs_50_eps_1e-06_neg_10_threads_8 | spearman | 0.42 | 0.43 | 0.44 | 0.43 | 0.44 | 0.44 |
|                                                           |          |      |      |      |      |      |      |
| cpp_model_burn_in_0_epochs_200_eps_1e-06_neg_20_threads_8 | spearman | 0.45 | 0.46 | 0.45 | 0.46 | 0.45 | 0.47 |
|                                                           |          |      |      |      |      |      |      |
|     gensim_model_batch_size_10_burn_in_0_epochs_50_neg_10 | spearman | 0.46 | 0.47 | 0.46 | 0.47 | 0.48 | 0.48 |
|                                                           |          |      |      |      |      |      |      |
|     gensim_model_batch_size_50_burn_in_0_epochs_50_neg_20 | spearman | 0.47 | 0.47 | 0.47 | 0.47 | 0.48 | 0.48 |
|                                                           |          |      |      |      |      |      |      |
|    gensim_model_batch_size_10_burn_in_10_epochs_50_neg_20 | spearman | 0.45 | 0.46 | 0.45 | 0.46 | 0.45 | 0.46 |
|                                                           |          |      |      |      |      |      |      |
|     gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20 | spearman | 0.47 | 0.46 | 0.47 | 0.48 | 0.48 | 0.48 |
|                                                           |          |      |      |      |      |      |      |
|                                 np_model_epochs_50_neg_20 | spearman | 0.15 | 0.20 | 0.21 | 0.21 | 0.25 | 0.27 |
|                                                           |          |      |      |      |      |      |      |
+-----------------------------------------------------------+----------+------+------+------+------+------+------+

4.4 Link Prediction for collaboration networks



In [68]:

    
# TODO - quite tricky, since the loss function used for training the model on this network is different
# Will require changes to how gradients are calculated in C++ code